The submission includes the following.
For the project, we use the following packages:
##
rm(list = ls())
library(cmdstanr)
Warning: package 'cmdstanr' was built under R version 4.4.3
library(dplyr)
library(splines)
library(janitor)
library(readr)
library(brms)
library(ggplot2)
library(tidyr)
library(priorsense)
options(mc.cores = parallel::detectCores()) # paralellize if possible
options(brms.file_refit = "on_change") # save the files if the model has changed
ggplot2::theme_set(ggplot2::theme_light()) # nicer theme
Select a dataset with clusters such as schools, regions, or people with multiple observations per individual. (From for example, https://www.kaggle.com/) It would be a good idea to choose a smallish dataset (not too many rows, e.g., less than 1000) or subset it so that fitting the models doesn’t take too long.
The original dataset was the The Real Estate Sales 2001–2022 dataset, which is a record of residential property transactions over twenty-two years (Masidekabakci, 2024). It includes raw sale prices (Sale.Amount), tax-assessed values (Assessed.Value), transaction dates, and various property attributes (e.g. location, size, year built). The data is grouped by time periods (e.g., months or years) and location coordinates. It is used/created for educational purposes for real estate analysis, or exploring patterns/models in property sales over time, but it is mostly useful for regression tasks in predicting sale amounts. However, it is not considered peer-reviewed, and no major papers have used this exact compilation, but similar datasets are used in the literature and Kaggle competitions. It is pulled from Kaggle, the link is:https://www.kaggle.com/code/masidekabakci/real-estate-sales-2001-2022.
Preprocessing steps we did include parsing the date field, handling or dropping missing values, converting categorical columns to factors, and (for modeling) log-transforming highly skewed columns like ‘sale_amount’ and ‘assessed_value’. We also converted the ‘Location’ column to two columns: ‘lon’ & ‘lat’, for the spatial GP model. Furthermore, we plotted the histogram of ’Sale_Amount. Lastly, we removed outliers outside the inter quartile range.
# load the data and preprocessing steps go here
getwd() # e.g. "/Users/niyaneykova/Desktop/Bayesian Modeling"
[1] "/Users/niyaneykova/Desktop/Bayesian Modeling"
# read from the same directory
real_estate_dataset <- read.csv("/Users/niyaneykova/Desktop/Bayesian Modeling/data/real_estate_dataset.csv")
head(real_estate_dataset)
List.Year Town Assessed.Value Sale.Amount Property.Type Residential.Type Non.Use.Code
1 2022 48 207700 400000 3 3 11
2 2022 38 49700 29900 3 3 4
3 2022 59 208230 423261 3 3 11
4 2022 53 169050 351000 3 3 4
Date_Recorded date_ordinal DayOfWeek WeekOfYear Month MonthName Quarter Season
1 2023-09-20 19620 3 38 9 9 3 4
2 2023-02-27 19415 1 9 2 2 1 1
3 2022-10-31 19296 1 44 10 10 4 4
4 2023-03-30 19446 4 13 3 3 1 2
MonthInSeason DOW_num Month_sin Month_cos Week_sin Week_cos DOW_sin
1 1 3 -1.0000000 -1.836970e-16 -0.9927089 -1.205367e-01 0.4338837
2 3 1 0.8660254 5.000000e-01 0.8854560 4.647232e-01 0.7818315
3 2 1 -0.8660254 5.000000e-01 -0.8229839 5.680647e-01 0.7818315
4 1 4 1.0000000 6.123234e-17 1.0000000 -1.608123e-16 -0.4338837
DOW_cos lon lat spatial_cluster
1 -0.9009689 -72.47379 41.78533 3
2 0.6234898 -71.87593 41.57067 16
3 0.6234898 -73.40230 41.51023 27
4 -0.9009689 -72.10181 41.43314 4
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Realestate <- real_estate_dataset
Realestate <- Realestate %>%
mutate(
log_sale = log(`Sale.Amount`),
log_assessed1 = log(`Assessed.Value` + 1)
)
# Select the most useful variables
Realestate_clean <- Realestate %>%
select(`List.Year`, Town, `Sale.Amount`, `Property.Type`, `Residential.Type`, `Assessed.Value`, Season, Non.Use.Code, lon, lat, log_sale, log_assessed1)
# Drop all rows containing missing values and values of zero
Realestate_clean <- Realestate_clean %>%
drop_na() %>%
filter(if_all(everything(), ~ . != 0))
# Inspect some variables on there levels
unique(Realestate_clean$Town)
[1] 48 38 59 53 9 10 39 32 15 45 26 34 52 22 7 16 41 5 36 25 27
[22] 1 46 3 12 20 4 2 23 30 11 37 64 84 73 85 57 51 66 71 94 106
[43] 102 92 108 103 13 43 18 47 55 19 14 54 28 35 33 76 89 77 86 75 90
[64] 91 87 98 101 56 49 8 60 40 82 65 99 105 63 72 81 97 70 100 95 67
[85] 88 96 68 104 74 29 31 42 44 107 50 58 78 93 80 24 83 61 21 17 6
[106] 69 79
unique(Realestate_clean$`Property.Type`)
[1] 3 4 1 2 6 5
unique(Realestate_clean$`Residential.Type`)
[1] 3 4 1 5 2
str(Realestate_clean)
'data.frame': 557 obs. of 12 variables:
$ List.Year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
$ Town : int 48 38 59 53 9 38 53 10 39 59 ...
$ Sale.Amount : num 400000 29900 423261 351000 3500000 ...
$ Property.Type : int 3 3 3 3 3 3 3 3 3 3 ...
$ Residential.Type: int 3 3 3 3 3 3 3 4 3 1 ...
$ Assessed.Value : num 207700 49700 208230 169050 769900 ...
$ Season : int 4 1 4 2 2 3 1 1 4 2 ...
$ Non.Use.Code : int 11 4 11 4 11 11 11 11 4 11 ...
$ lon : num -72.5 -71.9 -73.4 -72.1 -72.8 ...
$ lat : num 41.8 41.6 41.5 41.4 41.3 ...
$ log_sale : num 12.9 10.3 13 12.8 15.1 ...
$ log_assessed1 : num 12.2 10.8 12.2 12 13.6 ...
# Rename columns for use in brms
Realestate_clean <- clean_names(Realestate_clean)
# Randomly select a 1000 rows as a sample
Realestate_sample <- Realestate_clean %>%
sample_n(550)
# Convert character variables to factors
Realestate_sample <- Realestate_sample %>%
mutate(
Town = as.factor(town),
`Property.Type` = as.factor(`property_type`),
`Residential.Type` = as.factor(`residential_type`)
)
# Scale numeric predictor variables
Realestate_sample <- Realestate_sample %>%
mutate(
`assessed_value` = log(`assessed_value` + 1),
`list_year` = log(`list_year` + 1)
)
# Remove outliers from the target variable
# Calculate Q1 and Q3
Q1 <- quantile(Realestate_sample$sale_amount, 0.25, na.rm = TRUE)
Q3 <- quantile(Realestate_sample$sale_amount, 0.75, na.rm = TRUE)
# Calculate IQR
IQR_value <- Q3 - Q1
# Define bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Filter dataset to remove outliers
Realestate_sample <- Realestate_sample[Realestate_sample$sale_amount >= lower_bound &
Realestate_sample$sale_amount <= upper_bound, ]
# Rescale the target variable by a factor of 1000 to give more manageble figures
Realestate_sample$sale_amount <- Realestate_sample$sale_amount / 1000
# Check the target variable
# Minimum value
min_value <- min(Realestate_sample$sale_amount, na.rm = TRUE)
# Maximum value
max_value <- max(Realestate_sample$sale_amount, na.rm = TRUE)
# Print results
cat("Minimum Sale Amount:", min_value, "\n")
Minimum Sale Amount: 4.5
cat("Maximum Sale Amount:", max_value, "\n")
Maximum Sale Amount: 765
# Basic histogram of sale_amount
hist(Realestate_sample$log_sale,
main = "Histogram of Sale Amount",
xlab = "Sale Amount (in thousands)",
col = "lightblue",
border = "black"
)
#
After pre-processing of our dataset we have 518 observations and 15 variables remaining.
Our dependent variable will be the sale_amount (numeric), which is the price by which the object has been sold. The dependent variable has been divided by one-thousand to get make the number more manageable to use.
As independent variables we are going to use the following columns: 1. list_year: the year at which the object was listed for sale as a log transformed number. 2. Town (categorical): the town in which the property is located as a factor with 107 levels 3. log_assessed: the assessed value of the object before the sale as a log transformed number (numeric variable). 4. Season: the season in which the object was sold as a integer from 1 to 4, making it categorical. 5. non_use_code (categorical): code indicating properties that may not be used for typical purposes (e.g., vacant land) as a integer. 6. Property.Type and 7. Residential.Type categorical variablees describe the property’s classification type as property and usage. Numeric variables lon, and lat capture the property’s financial valuation and geographic location.
Split the data into training (80%) and test set (80%). Transform the columns if necessary.
# Number of observations
n <- nrow(Realestate_sample)
# Create a random sample of 80% row indices
train_indices <- sample(seq_len(n), size = 0.8 * n)
# Split the data
train_data <- Realestate_sample[train_indices, ]
test_data <- Realestate_sample[-train_indices, ]
# models go here
# Calculating Gelman et al. priors.
intercept_prior <- mean(Realestate_sample$log_sale)
SD <- sd(Realestate_sample$log_sale)
slope_prior <- 2.5 * sd(Realestate_sample$log_sale)
SD_prior <- 1 / sd(Realestate_sample$log_sale)
intercept_prior
[1] 12.06991
SD
[1] 0.9424508
slope_prior
[1] 2.356127
SD_prior
[1] 1.061063
def_priors_s <- c(
prior(normal(12, 0.94), class = Intercept),
prior(normal(0, 2.33), class = b),
prior(exponential(1.07), class = sigma)
)
# Fitting a first model using pooling based on cities.
Sale_Price_1 <- brm(log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 | Town)),
data = train_data,
family = gaussian(),
prior = def_priors_s,
iter = 4000,
chains = 4,
seed = 123,
file = "Sale_Price_1"
)
Sale_Price_1
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 | Town))
Data: train_data (Number of observations: 414)
Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup draws = 8000
Multilevel Hyperparameters:
~Town (Number of levels: 99)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.20 0.05 0.11 0.30 1.00 2638 4745
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 2.35 18.02 -33.36 37.65 1.00 12711 5049
list_year 0.28 2.37 -4.39 4.97 1.00 12717 4984
log_assessed1 0.71 0.04 0.64 0.78 1.00 9819 6523
season 0.04 0.03 -0.01 0.10 1.00 12787 5814
non_use_code -0.09 0.01 -0.11 -0.08 1.00 10871 6074
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.58 0.02 0.54 0.62 1.00 7630 5874
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Rhat = 1.00 across all parameters show chains converged & Bulk_ESS and Tail_ESS are high, indicating reliable estimates. Bulk_ESS ranges from 2638 (for sd(Intercept)) up to 12787 (season). Tail_ESS ranges from 4745 for sd(Intercept) up to 6523 for log_assessed1.
The intercept represents the expected log sale price of 2.35 when all predictors are zero (because we have not centered them). There is a modest difference in log sale price across towns, with the average differing by 0.20 (or exp(0.20) = 22%) from the global average.
The intercept and list_year have very large uncertainty. The model estimates a small effect for list_year,with a large standard error, and very wide 95% credible interval crossing 0, suggesting that list_year has no clear effect. The model assumes a linear effect of year, which may be inaccurate. Furthermore, the values for the predictors at 0 are not meaningful for real-world settings, so this result is not “interpretable” for real scenarios.
# Fitting a second model, now also allowing the slopes to vary with the assessed value this because it's reasonable to assume that the are differences in the assessed value per town. Next to that we have also added a horseshoe prior because it might be logical to think that not all variables add information.
def_priors_hs <- c(
prior(normal(12, 0.94), class = Intercept),
prior(horseshoe(par_ratio = 0.2), class = b),
prior(exponential(1.07), class = sigma)
)
Sale_Price_2 <- brm(log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 + log_assessed1 | Town)),
data = train_data,
family = gaussian(),
prior = def_priors_hs,
iter = 4000,
control = list(max_treedepth = 12, adapt_delta = 0.999),
chains = 4,
seed = 123,
file = "Sale_Price_2"
)
Sale_Price_2
Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above
0.999 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 + log_assessed1 | Town))
Data: train_data (Number of observations: 414)
Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup draws = 8000
Multilevel Hyperparameters:
~Town (Number of levels: 99)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.33 0.56 2.36 4.54 1.00 2163 3411
sd(log_assessed1) 0.27 0.05 0.19 0.38 1.00 2188 3373
cor(Intercept,log_assessed1) -1.00 0.00 -1.00 -1.00 1.00 2340 3119
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 3.29 3.49 -3.60 8.68 1.00 6053 5649
list_year 0.02 0.45 -0.70 0.91 1.00 8317 5738
log_assessed1 0.80 0.06 0.68 0.91 1.00 3876 5038
season 0.02 0.02 -0.02 0.06 1.00 5237 7940
non_use_code -0.09 0.01 -0.10 -0.07 1.00 8116 5796
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.52 0.02 0.48 0.56 1.00 6358 5938
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
All R-hat values show good convergence at 1.00. Next to this Bulk ESS and Tail ESS are all sufficiently high.
The model intercept estimates the log sale price to be 3.29 ( exp ≈ $27) when all predictors (list_year, log_assessed1, season, and non_use_code) are zero, but with a wide uncertainty. This estimation and interpretation, however, is not the most meaningful in real-world cases. The list_year, season show weak effects which might not be useful. There’s stronger variability in log sale prices across towns (3.33), much higher than in model 1, suggesting town differences are more variable now that the slopes also vary with log_assessed.
It is good to keep in mind that the since intercept represents the expected log sale price when all predictors are at zero, since they are not centered, these values are not meaningful in the context of the data (e.g., list_year), making it not directly interpretative in a real-world sense therefore, we need to center some predictors in a meaningful way. Because of this we have centered the list_year variable.
# updated model 3
# loosening the priors for the intercept and slopes
def_priors_loose <- c(
prior(normal(12, 2), class = Intercept), # mean and sd from data
prior(normal(0, 3.5), class = b), # looser slope prior
prior(exponential(1.07), class = sigma)
)
# centering list_year with mean year and logging it for correct interpretation
train_data <- train_data %>%
mutate(list_year_centered = list_year - mean(list_year, na.rm = TRUE))
# model 3
Sale_Price_3 <- brm(
formula = log_sale ~ list_year_centered + log_assessed1 + season + non_use_code +
(1 + log_assessed1 | town),
data = train_data,
family = gaussian(),
prior = def_priors_loose,
iter = 4000,
chains = 4,
seed = 123,
control = list(adapt_delta = 0.999, max_treedepth = 12),
file = "Sale_Price_3"
)
Sale_Price_3
Family: gaussian
Links: mu = identity; sigma = identity
Formula: log_sale ~ list_year_centered + log_assessed1 + season + non_use_code + (1 + log_assessed1 | town)
Data: train_data (Number of observations: 414)
Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup draws = 8000
Multilevel Hyperparameters:
~town (Number of levels: 99)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.33 0.56 2.35 4.54 1.00 3247 5312
sd(log_assessed1) 0.27 0.05 0.19 0.38 1.00 3319 5323
cor(Intercept,log_assessed1) -1.00 0.00 -1.00 -1.00 1.00 2635 4379
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 3.33 0.71 1.92 4.69 1.00 4987 5984
list_year_centered 0.85 3.41 -5.92 7.58 1.00 17337 5511
log_assessed1 0.80 0.06 0.69 0.92 1.00 5021 5786
season 0.03 0.03 -0.02 0.08 1.00 17376 6292
non_use_code -0.09 0.01 -0.11 -0.07 1.00 16226 6236
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.52 0.02 0.48 0.56 1.00 11046 5478
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
All R-hat values are at 1.00, indicating good convergence. Bulk ESS and tail ESS are all sufficiently high.
The expected log sale price is 3.33 for an average year (log centered) with other predictors at zero, and average town effect. Much more reasonable than before but only for the year predictor. There is still large variability in baseline log sale prices across towns (3.33), but meaningful but small effect of log assessed across towns (0.27). The error sigma of 0.52 suggests relative noise.
The intervals for list year and intercept are the largest, also crossing zero for list_year_centered. This indicates there is still no real effect of list_year_centered on the dependent variable.
# Prior sensitivity analysis of model 1
plot(Sale_Price_1, ask = FALSE)
Traceplots demonstrate good mixing across all four chains, indicating good convergence. The average deviation of the intercept across towns is about 0.20 log-sale units; sigma is centered around 0.55–0.60 making it off by 1.8 for correct individual house price predictions. log_assessed1 and non_use_code are the only meaningful, clearly estimated predictors in this model with tighter confidence intervals. There is moderate variation in baseline prices across towns (0.20) = exp(0.20) = 22% difference in sales across towns
plot(Sale_Price_2, ask = FALSE)
The Intercept and list_year are shrunk towards zero, expected by the horseshoe prior, the traceplots do show some spikes out of the center of the traceplot.
log_assessed1: Histogram shows most values are > 0, with some spread. The traceplot shows active sampling across chains, not collapsed near zero.This predictor is important and not strongly shrunk; it contributes meaningfully to predicting log sale price.
season: Histogram shows highly concentrated values near 0. The traceplot is well mixed suggesting good convergence.
non_use_code: Histogram also shows values near 0, though with slightly more spread than season.The trace plot show similar patterns.
Further traceplots show some that are not well mixed, for example sdb_list_year.
plot(Sale_Price_3, ask = FALSE)
Model 3 shows mostly stable trace-plots that exceed the two previous models. Only cor_town_intercept_log_assessed1 shows a not well mixed traceplot.
Model 1: Our aim was to establish a baseline model. We selected independent variables from the dataset that, based on common sense reasoning, are expected to have the strongest predictive value. These include: list_year, assessed_value, season, non_use_code, and town (used as a grouping variable). Given the distribution of assessed_value, we applied a log transformation to both this variable and the dependent variable to bring them into a more manageable range. For the priors, we opted for weakly informative Gelman priors.
Model 2: For the second model we now also allow the slopes to vary with the assessed value, because it’s reasonable to assume that the are differences in the assessed value per town. Next to that, we have added a horseshoe prior because it might be logical to think that not all variables add information. Due to warnings we got we have set the max_treedepth to 12 and the adopt_delta to 0.999.
Model 3: This final version is a multilevel structured model where the slopes vary with the assessed value per town, as in model 2. However, this model improves upon model 2 by widening the intercept and slopes to resolve prior-data conflicts from the prior analysis shown in model 2, while still maintaining some regularization. The priors are set back to Gelmen, as to avoid the over-regularization of the model 2. It also centers the logged list_year feature at the mean as a baseline, rather than 0 as the previous models, making it more real-life interpretible. The population fixed effects are the intercept, list_year_centered, log_assessed, season, non_use_code, which show how they relate to log_price. The group level effects capture the random intercept and random slope, allowing each town to have its own baseline log_price, and additionally each town has its own sensitivity to the assessed value. The priors are loosened enough to let the data speak more, still weakly informative. We used a normal(12, 2) prior on the intercept to center it at the mean log_price with a ± 2-unit deviation, a normal(0, 3.5) prior on all slopes to allow moderately large effects while still shrinking any excessive noise, and an exponential(1.07) prior on sigma to match the data’s residual variability. We additionally used Guassian as a likelihood again, for a comparable model result, despite prices being positive and right-skewed (other distributions were omitted for comparability purposes).
# Prior sensitivity analysis of model 1
powerscale_sensitivity(Sale_Price_1)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood diagnosis
b_Intercept 0.089 0.033 potential strong prior / weak likelihood
b_list_year 0.090 0.037 potential strong prior / weak likelihood
b_log_assessed1 0.002 0.200 -
b_season 0.002 0.108 -
b_non_use_code 0.002 0.104 -
sd_Town__Intercept 0.003 0.504 -
sigma 0.005 0.380 -
Intercept 0.003 0.063 -
r_Town[1,Intercept] 0.003 0.076 -
r_Town[2,Intercept] 0.004 0.154 -
r_Town[3,Intercept] 0.002 0.140 -
r_Town[4,Intercept] 0.002 0.246 -
r_Town[5,Intercept] 0.004 0.181 -
r_Town[6,Intercept] 0.003 0.162 -
r_Town[7,Intercept] 0.003 0.236 -
r_Town[8,Intercept] 0.003 0.124 -
r_Town[10,Intercept] 0.002 0.050 -
r_Town[11,Intercept] 0.003 0.230 -
r_Town[12,Intercept] 0.003 0.442 -
r_Town[13,Intercept] 0.003 0.227 -
r_Town[14,Intercept] 0.005 0.288 -
r_Town[15,Intercept] 0.002 0.112 -
r_Town[16,Intercept] 0.002 0.167 -
r_Town[18,Intercept] 0.004 0.155 -
r_Town[19,Intercept] 0.004 0.174 -
r_Town[20,Intercept] 0.003 0.273 -
r_Town[21,Intercept] 0.003 0.192 -
r_Town[22,Intercept] 0.003 0.148 -
r_Town[23,Intercept] 0.003 0.089 -
r_Town[24,Intercept] 0.002 0.041 -
[ reached 'max' / getOption("max.print") -- omitted 77 rows ]
# Prior sensitivity analysis of model 2
powerscale_sensitivity(Sale_Price_2)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood
b_Intercept 0.075 0.027
b_list_year 0.079 0.024
b_log_assessed1 0.016 0.115
b_season 0.087 0.070
b_non_use_code 0.014 0.135
sd_Town__Intercept 0.050 0.245
sd_Town__log_assessed1 0.050 0.242
cor_Town__Intercept__log_assessed1 0.041 0.390
sigma 0.011 0.437
hs_global 1.452 0.357
hs_slab 1.103 0.193
sdb_list_year 0.628 0.068
sdb_log_assessed1 0.155 0.047
sdb_season 0.743 0.095
sdb_non_use_code 0.439 0.113
Intercept 0.008 0.037
r_Town[1,Intercept] 0.009 0.033
r_Town[2,Intercept] 0.017 0.116
r_Town[3,Intercept] 0.016 0.055
r_Town[4,Intercept] 0.018 0.060
r_Town[5,Intercept] 0.014 0.156
r_Town[6,Intercept] 0.012 0.064
r_Town[7,Intercept] 0.011 0.126
r_Town[8,Intercept] 0.009 0.042
r_Town[10,Intercept] 0.012 0.070
r_Town[11,Intercept] 0.018 0.153
r_Town[12,Intercept] 0.010 0.269
r_Town[13,Intercept] 0.011 0.050
r_Town[14,Intercept] 0.016 0.070
r_Town[15,Intercept] 0.012 0.033
diagnosis
potential strong prior / weak likelihood
potential strong prior / weak likelihood
-
potential prior-data conflict
-
-
potential prior-data conflict
-
-
potential prior-data conflict
potential prior-data conflict
potential prior-data conflict
potential strong prior / weak likelihood
potential prior-data conflict
potential prior-data conflict
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
[ reached 'max' / getOption("max.print") -- omitted 184 rows ]
# Prior sensitivity analysis of model 3
powerscale_sensitivity(Sale_Price_3)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
variable prior likelihood diagnosis
b_Intercept 0.011 0.101 -
b_list_year_centered 0.095 0.052 potential prior-data conflict
b_log_assessed1 0.011 0.098 -
b_season 0.003 0.103 -
b_non_use_code 0.004 0.134 -
sd_town__Intercept 0.046 0.234 -
sd_town__log_assessed1 0.046 0.229 -
cor_town__Intercept__log_assessed1 0.036 0.426 -
sigma 0.002 0.416 -
Intercept 0.002 0.052 -
r_town[1,Intercept] 0.005 0.052 -
r_town[2,Intercept] 0.009 0.090 -
r_town[3,Intercept] 0.007 0.052 -
r_town[4,Intercept] 0.007 0.060 -
r_town[5,Intercept] 0.011 0.127 -
r_town[6,Intercept] 0.007 0.030 -
r_town[7,Intercept] 0.009 0.102 -
r_town[8,Intercept] 0.008 0.042 -
r_town[10,Intercept] 0.008 0.065 -
r_town[11,Intercept] 0.012 0.098 -
r_town[12,Intercept] 0.014 0.285 -
r_town[13,Intercept] 0.007 0.045 -
r_town[14,Intercept] 0.009 0.090 -
r_town[15,Intercept] 0.009 0.078 -
r_town[16,Intercept] 0.007 0.033 -
r_town[18,Intercept] 0.007 0.038 -
r_town[19,Intercept] 0.008 0.097 -
r_town[20,Intercept] 0.008 0.166 -
r_town[21,Intercept] 0.009 0.066 -
r_town[22,Intercept] 0.007 0.038 -
[ reached 'max' / getOption("max.print") -- omitted 178 rows ]
Powerscale sensitivity analysis of model 1 shows weak likelihood/too overpowering priors for the intercept and b_list_year feature, suggesting they are too restricted or tight.
In model 2, again, strong priors for the intercept and b_list_year, with additional prior-data mismatch for season, sd_Town__Intercept, etc., possibly due to over-regularization from the horseshoe prior on the slopes, causing more shrinkage of several effects.
In model 3 we reverted back to Gelman priors which we loosened up for the intercept and slopes. This removed most of the prior-data conflicts. The only prior-data conflict that remained was that for the b_list_year_centered variable.
# Posterior predictive checks for model 1
pp_check(Sale_Price_1, type = "intervals", x = "log_sale", prob_outer = 0.95)
# Posterior predictive checks for model 1
pp_check(Sale_Price_1, ndraws = 200)
# Posterior predictive checks for model 1
pp_check(Sale_Price_1, type = "stat_2d")
# Posterior predictive checks for model 2
pp_check(Sale_Price_2, type = "intervals", x = "log_sale", prob_outer = 0.95)
# Posterior predictive checks for model 2
pp_check(Sale_Price_2, ndraws = 200)
# Posterior predictive checks for model 2
pp_check(Sale_Price_2, type = "stat_2d")
# Posterior predictive checks for model 3
pp_check(Sale_Price_3, type = "intervals", x = "log_sale", prob_outer = 0.95)
# Posterior predictive checks for model 3
pp_check(Sale_Price_3, ndraws = 200)
# Posterior predictive checks for model 3
pp_check(Sale_Price_3, type = "stat_2d")
Model 1: The posterior predictive checks show that the models doesn’t give good predictions for the low region. Possible due to the lack of data in this region. The density increases with higher log_sale values, but the interval is wide suggesting a high uncertainty. The model is most confident, showing tighter predictive intervals, for the mid region of the data. The density overlays plot shows the model captures the overall data shape quite well, being a bit on the low side however. The observed distribution (dark blue line) mostly lies within the predictive light blue lines. The last plot of the posterior predictive check shows that the model’s simulated distributions of log_sale closely match the observed data for both the mean and standard deviation. The observed dark blue point lies within the simulated points (light blue cloud), which indicates a good fit.
Model 2: The posterior predictive checks for model 2 give the same image as for model 1 but now showing less uncertainty for the major part of the first plot, indicating it has a good fit.
Model 3: The posterior predictive checks show the same results for model 3 as for model 2.
set.seed(123)
k <- loo::kfold_split_random(K = 10, N = nrow(train_data))
Sale_Price_1 <- kfold(Sale_Price_1, chains = 1, folds = k, save_fits = FALSE)
Sale_Price_2 <- kfold(Sale_Price_2, chains = 1, folds = k, save_fits = FALSE)
Sale_Price_3 <- kfold(Sale_Price_3, chains = 1, folds = k, save_fits = FALSE)
print(Sale_Price_1)
Based on 10-fold cross-validation.
Estimate SE
elpd_kfold -378.7 25.8
p_kfold 33.2 5.8
kfoldic 757.5 51.6
print(Sale_Price_2)
Based on 10-fold cross-validation.
Estimate SE
elpd_kfold -353.5 28.0
p_kfold 59.8 11.1
kfoldic 707.0 55.9
print(Sale_Price_3)
Based on 10-fold cross-validation.
Estimate SE
elpd_kfold -353.9 28.1
p_kfold 60.5 11.5
kfoldic 707.9 56.2
set.seed(123)
# Question 5
loo_compare(Sale_Price_1, Sale_Price_2, Sale_Price_3)
elpd_diff se_diff
Sale_Price_2 0.0 0.0
Sale_Price_3 -0.4 1.2
Sale_Price_1 -25.3 14.7
# Question 6
fit <- readRDS("Sale_Price_2.rds")
fx <- fixef(fit)[-1, ]
print(head(fx[order(abs(fx[, "Estimate"]), decreasing = TRUE), ], 5), digits = 3)
Estimate Est.Error Q2.5 Q97.5
log_assessed1 0.7952 0.05830 0.6847 0.9120
non_use_code -0.0869 0.00863 -0.1039 -0.0701
list_year 0.0224 0.45139 -0.6997 0.9090
season 0.0151 0.02102 -0.0181 0.0633
# Question 7
set.seed(123)
pred <- exp(colMeans(posterior_predict(fit, newdata = test_data, allow_new_levels = TRUE)))
obs <- exp(test_data$log_sale)
rmse <- sqrt(mean((pred - obs)^2))
mae <- mean(abs(pred - obs))
mape <- mean(abs((pred - obs) / obs)) * 100
cat("RMSE =", rmse, "\n")
RMSE = 123544.7
cat("MAE =", mae, "\n")
MAE = 83892.18
cat("MAPE =", mape, "%\n")
MAPE = 62.32736 %
Determine the best model based on predictive accuracy and justify your decision.
If we compare elpd_diff and se_diff across the three models, we can see that Sale_Price_2 is the best model with both elpd_diff and se_diff being zero, also used as baseline. Sale_Price_1 has an elpd_diff of −25.3 and a se_diff of 14.7. The elpd_diff does differ at least 4 from the baseline model but if we look at the se_diff we can see that it is large and it doesn’t pass the 2 * SE rule. This indicates that Sale_Price_2 and Sale_Price_1 don’t differ meaningfully.
Sale_Price_3 falls within the previous two models with a elpd_diff of -0.4 and a se_diff of 1.2. Again, the difference is below the threshold of 2 * se_diff and the threshold of 4 for the elpd_diff. Overall, Sale_Price_2 is chosen, but with only weak evidence and not being meaningfully better in predictive performance than the other models.
Sale prices in this model rise almost proportionally with assessed value: a one-percent increase in the assessed figure translates into roughly a 0.8 percent increase in the expected sale price, so doubling the assessment more than doubles what a property is likely to fetch. By contrast, parcels classified as “Vacant” carry a clear penalty, selling for about eight percent less than comparable properties with the baseline use-code. The calendar year of listing hints at a two-percent annual upward drift in prices, but the wide uncertainty band around that estimate means the data do not rule out the possibility of no real trend at all. Seasonality shows an even smaller, statistically unclear effect, implying that when the property is listed within the year probably does not have much of an effect when taking other factors into account.
Report RMSE or other loss (or utility) function on the test set. (Transform it back if necessary).
We chose RMSE and MAE as loss functions. We first transformed the data from logscale to the original scale, to ensure that the RMSE and MAE are interpretable. The RSME and MAE were 123682.4 and 83957.06 respectively. For the sake of interpretation, we also included MAPE to understand the relative magnitude of the error, which was reported to be 62.3%.
Jonathan Otterspeer: model 2, dataset pre-processing, output interpretation, and template code, 5,6,7
Niya: model 3, model checking, output interpretation, PDF file, references, questions 1a, 3b, 4a b
Johan Vriend: model 1, dataset selection, output interpretation, setting-up code template, 1b, 2, 3a,b
During the making of this assignment Chat-GPT was used to deepen knowledge provided during this course. Example of a prompt used: You are a statistician specialized in Bayesian statistics. Explain the exact workings of the horseshoe prior in Bayesian statistics in normal human language. There has been no use of AI tools in assessing and evaluating the models.
Masidekabakci. (2024, December 30). Real Estate sales 2001-2022. Kaggle. https://www.kaggle.com/code/masidekabakci/real-estate-sales-2001-2022